-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2417: Add support for geometry logical type #2971
base: master
Are you sure you want to change the base?
PARQUET-2417: Add support for geometry logical type #2971
Conversation
This PR is copied form this place: apache#1379
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
…e spherical edge is specified.
…apache-parquet-2417-geospatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have left some comments. I think we are reaching the finish line!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryUtils.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests!
I think we are missing tests in following cases:
- verify geometry type metadata is well preserved.
- verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
- verify geo stats in the column index have been generated.
I can do these later.
…apache-parquet-2417-geospatial
…apache-parquet-2417-geospatial
@wgtmac please take a look :-) |
Sure, I will take a look. Thanks! |
I am depending on this PR to build geo support for iceberg. I got lots of test failures when building this branch locally:
NPE is thrown when reading parquet files without geo columns. Can we apply the following patch to resolve this problem? diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
index 3efc9345..22e51783 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
@@ -961,6 +961,9 @@ public class ParquetMetadataConverter {
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
GeospatialStatistics formatGeomStats, PrimitiveType type) {
+ if (formatGeomStats == null) {
+ return null;
+ }
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
if (formatGeomStats.isSetBbox()) {
BoundingBox formatBbox = formatGeomStats.getBbox(); |
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Please see my inline comments.
@@ -30,6 +32,7 @@ public class BinaryStatistics extends Statistics<Binary> { | |||
|
|||
private Binary max; | |||
private Binary min; | |||
private GeospatialStatistics geospatialStatistics = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to leave BinaryStatistics
as is without any change. This has been discussed in the spec PR that GeospatialStatistics
should be similar to the implementation of SizeStatistics
. Users can directly call ColumnChunkMetaData.getGeospatialStatitics()
to get it. Please revert all changes in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at ColumnValueCollector
class and follow the same approach of collecting SizeStatistics
for GeospatialStatistics
.
@@ -178,6 +183,30 @@ public Statistics<?> build() { | |||
} | |||
} | |||
|
|||
// Builder for GEOMETRY type to handle GeometryStatistics | |||
private static class GeometryBuilder extends Builder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment in BinaryStatistics.java, we need to revert all changes in this file as well.
xMin = minX; | ||
xMax = maxX; | ||
} else { | ||
// Handle the wraparound case for X values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if it is geography
type first or use a subclass for bbox of geography type? IIUC, geometry
type does not need to check wraparound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check wraparound for geometry type.
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryTypes.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java
Outdated
Show resolved
Hide resolved
// calculate the geometryStatistics from the BinaryStatistics | ||
GeospatialStatistics geospatialStatistics = null; | ||
if (currentStatistics instanceof BinaryStatistics) | ||
geospatialStatistics = ((BinaryStatistics) currentStatistics).getGeospatialStatistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refactor the change to be similar to currentSizeStatistics
.
} | ||
} | ||
|
||
class GeometryColumnChunkMetaData extends ColumnChunkMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not create another ColumnChunkMetaData subclass. Please follow what we did for SizeStatistics
.
This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.
Here is the proposal: apache/parquet-format#240
Jira
Tests
Commits
Documentation